- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 67
Switch to (Re)TestItems #262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
f3dba17    to
    68b31c6      
    Compare
  
    68b31c6    to
    61abd23      
    Compare
  
    29397bf    to
    2eac7f4      
    Compare
  
    We can guarantee these test images will always be available, which is not the case for the current sample image.
| Still needs a bit of work for GPU CI on 1.6 and nightly (possibly disabling the latter for now), but this is mostly good to go. Some timings: 
 It appears we spend a lot of time compiling, as evidenced by the large time savings when similar models are run one after another. ViTs are an outlier despite their relative runtime slowness because they use the (type unstable under AD) Vector  | 
| 
 During my GSoC, we explored this and I had noticed that when training, the Vector Chain gave me extremely bumpy loss curves – one of the reason we removed them from 0.7 to 0.8. A lot of this can come back slowly if we train more to isolate the exact problem, I think. | 
| With the renewed interest in #198 (comment), now may be the time to revisit what's causing these mysterious instabilities during training. Shall we continue the discussion there? | 
Co-authored-by: Kyle Daruwalla <[email protected]>
b00c6c6    to
    eee59a9      
    Compare
  
    `reclaim` to load the CUDA driver and fails otherwise
50% per worker so we avoid
| Ok, Buildkite is happy and so am I. This should be good to go. We now should have a pretty good picture of what works and doesn't on GPU too! | 
The impetus for this PR was twofold:
Along the way, I found some additional changes which could either be tackled here or in a follow-up PR:
WideResNeton GHA (fixed)My feeling is that we'd want to set aside a subset of faster tests for 1.6/nightly/GPU CI. Maybe the smallest variant of each model. Then we can decrease our overall runtime while expanding our version matrix to cover everything we probably should've been covering.
PR Checklist